---
title: 'Audio Processing'
description: 'Real Time Speech Recognition Pipelines'
---

You can build real time pipelines with Indexify that incorporates speech, build applications that retrieve information from the audio.

## Speech to Text

Generally, audio processing pipelines starts of by converting audio to text.

* **Automatic Speech Recognition (ASR)** - Converting speech to text. If all you want is a plain transcription, you can use a model like **Whisper** in a function that
extracts transcriptions from audio. The entire text and chunks with timestamps are represented as metadata of your function's output.
```python
@indexify_function()
def transcribe_audio(audio: bytes) -> str:
   # Invoke an ASR model to transcribe the audio
   ...
```
* **Speaker Diarization** - Applications such as meeting transcriptions, often require idenitfying who-said-what. Speaker Diarization is the process of 
segmenting and clustering the audio into speaker segments. This can be done using a model like **PyAnnotate**.
```python
class SpeakerSegment(BaseModel):
    speaker: str
    transcript: str
    start: float
    end: float

@indexify_function()
def diarize_audio(audio: bytes) -> List[SpeakerSegment]:
   # Invoke a Speaker Diarization model to segment the audio
   ...
```

<Note>
You can use commercial ASR and Speaker Diarizaition services with Indeixfy as well. They often perform better for accented speech.
</Note>

## Speech to Speech and Text to Speech

* **Voice Interfaces** - You can build voice interfaces that take in speech and generate speech. You would be using a TTS model in a function
that accepts text and generates speech. The function would return a byte array of the audio.

## Dynamic Routing 

Speech processing is complex, and often a single model doesn't perform well on all accents and languages. You can insert a dynamic Router
in your pipeline, which routes the audio to different ASR models by classifying accent, language or other features in the audio.


## Examples 
[Meeting Transcription and Summarization](https://github.com/tensorlakeai/indexify/tree/main/examples/video_summarization)
* Speaker Diarization
* Classification of meeting intent 
* Dynamic Routing between summarization models
